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Abstract 

We show that under a linearity condition on the distribution of the predic- 
tors, the coefficient vector in a single-index regression can be estimated with 
the same efficiency as in the case when the link function is known. Thus, 
the linearity condition seems to substitute for knowing the exact conditional 
distribution of the response given the linear combinations of the predictors. 



1 Introduction 

1.1 Single-index Regressions 

Consider a continuous univariate response Y and a vector of continuous predictors 
X E W. The most general goal of a regression is to infer about the conditional 
distribution of Y\X. In this paper we consider single-index regressions, in which 
Y\X depends on X through at most one linear combination (3qX of the predictors. 

Focusing on the mean function E(F|X), Hardle and Stoker (1989) developed 
a nonparametric method called average derivative estimation for estimating (3 in 
the single-index conditional mean E(F|X) = g(0^X), where the mean function 
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g is unknown. Weisberg and Welsh (1994) considered the case in which Y\X 
follows a generalized linear model, where the linear coefficient (3 and the link 
function are unknown. Both pairs of authors gave estimates for f3 that are \fn- 
consistent. 

Yin and Cook (2005) proposed the problem of single-index regressions, in 
which the conditional distribution of Y\X is completely characterized by a linear 
combination (3qX, so there is no loss of information about Y if we replace X with 
0qX. More specifically, we assume that 

Y\lX\^X (1) 

where for identifiability purposes, we require that ||/3 || = 1. (1) is equivalent to 
the statement that Y\X has a conditional density rjo(y\/3Qx), where rj is unknown. 
Single-index regression is a special case of sufficient dimension reduction when 
the dimension of the central subspace (Cook, 1996) is one. It does not require a 
pre-specified single-index model. 

We show that under the linearity condition (Li and Duan, 1989), for single- 
index regressions there exists an adpative estimate for j3 that can be estimated 
with the same efficiency as the maximum likelihood estimate when the condi- 
tional density i] is completely specified. For example, if the true model is Y = 
g(0^X) + e, where the link function g and the density of the error e are unknown, 
then (3 can be estimated with the same efficiency as in the case when g and the 
error distribution are known. 

1.2 Linearity Condition 

Many sufficient dimension reduction methods require the linearity condition: E(X\(3jX) 
is a linear function of /3qX (Li and Duan, 1989). It is used in popular methods 
like sliced inverse regression (Li, 1991), sliced average variance estimation (Cook 
and Weisberg, 1991) and principal Hessian directions (Li, 1992, Cook, 1998). 

The linearity condition holds if the predictor has an elliptically distribution 
(Eaton, 1986), so it holds when X has a multivariate normal distribution. More- 
over, Diaconis and Freedman (1984) showed that most low-dimension projections 
of a high-dimension data cloud are close to being normal. Hall and Li (1993) 
argued that the linearity condition holds approximately when p is large. The lin- 
earity condition applies only to the marginal distribution of the predictors and not 
to the conditional distribution of Y\X as is common in regression modeling. Con- 
sequently at the stage of data collection, we might design the experiment so that 
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the distribution of X will not blatantly violate elliptic symmetry. We can also 
transform the predictors to normality, or we can re-weight the data (Cook and 
Nachtscheim, 1994) to approximate an ellliptical distribution. 

1.3 Adaptive Estimation 

The problem of adaptive estimation was introduced by Stein (1956). One wishes 
to estimate a Euclidean parameter 9 in the presence of an infinite-dimensional 
shape parameter G, usually the density. An adaptive estimate performs asymp- 
totically as well with G unknown as the maximum likelihood estimate does when 
G is known (Bickel, 1982). A general method of constructing adaptive estimates 
was constructed by Bickel (1982). Schick (1986, 1993) generalized and improved 
Bickel's method. 

It has been shown that adaptive estimation is possible in the symmetric lo- 
cation problem, in which we need to estimate the center of symmetry of an un- 
known distribution (Stone, 1975). It is also possible in linear regressions where 
the error density is symmetric and unknown and we need to estimate the linear 
coefficient (Bickel, 1982). When the observations are not independent, Koul and 
Pflug (1990), Schick (1993), Koul and Schick (1996) showed adaptive estimation 
is possible in certain autoregressive models. Early literature on adaptive estima- 
tion generally focused on these models and their generalizations. In this paper we 
show that under the linearity condition, adaptive estimation is also possible for 
single-index regressions. 

2 Main Results 

Without loss of generality we assume that X has mean zero and covariance I p . 
We also assume that j3 e 0, where 

= {/3eR p : = 1}. 

Let l(t, y) = (d / dt)r)o(y\t) / r)o(y\t) be the derivative of the log density or equiv- 
alently the log likelihood. By using a Lagrange multiplier, the score equation for 

Po is 

Q A) E[XZ(/3 T X,F)] = 0, (2) 

where = I p —P^ and is the orthogonal projection onto the subspace spanned 
by the columns of the matrix (. 

It can be shown that (2) holds not only for /, but for any /(•,•) G R. 
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Lemma 1. Assume that the linearity condition holds. Assume /(•,•) G R. Then 
f3 is a solution of the equation 

QpE[Xf((3 T X,Y)} = 0. (3) 

Proof. Since X has covariance matrix I p , according to Cook (1998, pp. 57), we 
have E[Q(3 X\^X] = 0. Therefore 

Qp E[Xf(ffi X ,Y)} = E[Q Po Xf{^X,Y)\ 

= E{E[Qp Xf(ffiX,Y)]\ffiX} 

= E{E[g /3o X|/3 T X]E[/(/3 T X,F)|/3 T X]} 

= 

□ 

The above lemma shows that a misspecified / still produces a Fisher consistent 
estimate of j3 . According to van der Vaart (1998, Theorem 25.27), Lemma 1 
together with some regularity conditions would enable us to construct an adaptive 
estimate for /3 . The regularity conditions are typically satisfied in practice. A 
proof of the following theorem is given in the appendix. 

Theorem 1. Assume that the Fisher information X(0) = Ep[XX T l 2 ((3 T X,Y)] 
is finite, nonsingular and dijferentiable with respect to (3 in a neighborhood of (3 . 
Let l n (t, y) be an estimate ofl(t, y) that satisfies 

Ep [\\X\\ 2 (i(ffix,Y)-l(ffix,Y)) 2 } = 0p (l). (4) 

Then under the linearity condition we can construct an adaptive estimate of(3 in 
(1) based on l n (t,y). 

Following van der Vaart (1998, pp. 393), an adaptive estimate can be con- 
structed in the following way. Suppose (3 n is a v/n-consistent estimate of (3q. 
For instance, under the linearity condition (3 n can be chosen as the ordinary least 
squares estimator (Li and Duan, 1989). Let r n be a p x (p — 1) matrix such that 
(r n , P n ) is an orthogonal matrix. Let 

i=i 
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be an estimator of the information matrix for (5. Let (5 n be a one-step iteration of 
the Newton-Raphson algorithm for solving the equation 

n 

Q f ,Y t [X i i n (l3 T X i ,Y i )] = 

i=l 

with respect to j3 on the manifold 0, starting at the initial guess j3 n . We can write 
$ n as 

1 n 
(3 n = (5 n + -Y n [YlX n V n \-^l T^lXiWlX,, Yi)\ (5) 
n A — ' 

Van der Vaart (1998, Theorem 25.27) showed that, by using discretization and 
sample- splitting devices, f3 n is an adaptive estimate of f3 if I satisifies (4). One 
such I based on the kernel density estimation in Hardle and Stoker (1989) is con- 
structed in the Appendix. 

Since j3 n is an adaptive estimator, it has the same asymptotic distribution as the 
maximum likelihood estimator. Next we will derive the asymptotic distribution of 
the maximum likelihood estimator. Let (3 m \ e be the maximum likelihood estimator 
of /3q. It is shown in the appendix that under mild regularity conditions f3 m \ c has 
the following asymptotic distribution. 

Theorem 2. Assume that the regularity conditions for the asymptotic normality 
of the maximum likelihood estimate hold. Then 

Lie = Po + -r [rJx(/3 )r ]- 1 rT ^[x^x,, y;)] + o p {n- 1 ' 2 ) (6) 

t=l 

where T is ap x {p — 1) matrix such that (r , f3 ) is an orthogonal matrix. 

Since f3 n is an adaptive estimator, it has the same asymptotic distribution as 
/3 mlc , we conclude that ^/n((5 n — f3 ) converges to a normal distribution with zero 
mean and covariance matrix equal to the covariance matrix of ro[roX(/3o)r ] _1 rJX/(/3jX, Y)]. 

3 Discussion 

In this article we showed that under the linearity condition, there exists an adap- 
tive estimate of the coefficient vector in a single-index regression. From this result 
we can see the important role of the lineartiy condition in single-index regression, 
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and more generally, in sufficient dimension reduction. The linearity condition is 
unusual, as it does not occur commonly outside of sufficient dimension reduction. 
We have shown that the linearity condition asymptotically takes the place of a 
known density. We conjecture that if the linearity condition fails, then an adap- 
tive estimate does not exist. As a consequence, the coefficient vector cannot be 
estimated as well as it can be with the maximum likelihood estimator. 
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Appendix 



Proof of Theorem 1. Since Lemma 1 holds, according to van der Vaart (1998, 
Theorem 25.27), we only need to prove the following two statements. 



1. The conditional density rjo(y\/3jx) is differentiable in quadratic mean with 
respect to j3 . 



2. Let h(/3 T x, y) be the joint density of /3 T X and Y, then 

J \\x\\ 2 l(fix, y) y/h(ffx, y) - l(^x, y) y/ h(ffix, y) 



dxdy — > 0. 



The first statement is true by van der Vaart (1998, Theorem 7.2). So we only 
need to prove the second statement. 
Since 



X(/3 ) 



XX 



l(ffix,y)yjh(ffix,y) 



and 



APn) = Jxx T l((3 T a x,y)^h(^x,y) 



dxdy 



dxdy 



By the assumptions, Z(/3) is continuous on a neighborhood of [3 and X(/3 ) is 
finite, we conclude thatX(/3 n ) is also finite, hence tr[X(/3 ) +T(f3 n )] < oo. By the 
triangular inequality, 



/ 



\l(fix, y) | y/UMx~J) + \l(^x, y) | yjh(ffix, y) 
< tr[X(/3 )+X(/3 n )]<oo 
Then by the dominate convergence theorem, 



dxdy 



I 



2 l(ffx,yWh(ffx,y) - l(ffix,y)y/h(0*x,y) 
Therefore the second statement is also true. 



dxdy — ?■ 



□ 
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Proof, of Theorem 2. We first transform the manifold to W 1 by using the fol- 
lowing linear transformation. For any (3 e ©, let a — y?(/3) = T /3. Then 
f3 = ip- 1 (a) = T a + (l-\\a\\ 2 )/3 , and rj (y\/3 T x) = ri {y\tp- 1 (a)' 1 x) . By taking 
the derivative of r]Q(y\ip~ 1 (aj I x) with respect to a, we can derive the asymptotic 
distribution of the maximum likelihood estimate for a as following, 

n 

a mlc = -[rT J(A))] - i r x Y^XiliftXi, Y,)] + o p (n-^) 

Tl . 

i=l 

By the delta method, we have 

/Lie = A, + -r [rTj(/3 )r ]- 1 rJ ^[x^x,, y;)] + o p { n - 1 / 2 ) 

i=i 

□ 

Construction of I that satisfies (4). Let h((3jx, y) be the joint density of {0qX, Y), 
and g(/3jx) be the density of 0qX, then 

Vo(y\^x) = h(^x,y)/g(^x) 

and 

l = ti/h-g'/g, 

where ft,', </ are the derivative of h, g w.r.t. the first argument. To estimate /, we 
only need to estimate h' /h and g'/g. 

We only consider the estimation of g'/g in detail here, because h'/h can be 
estimated in the same way, except that the dimension of the density estimation is 
different. Let d be the dimension of the density estimation, d = 1 for g and d — 2 
for h. 

Let Tj = (3jXi. For a fixed twice continuously differentiable probability den- 
sity w with compact support, a bandwidth parameter a, and a cut-off tuning pa- 
rameter S, set 

us) = ^Em^) 

= 9 f{s)h n{s) >5 (7) 

where | n (s) is our estimator of g'{s)/g{s). Then E[(| n (X) - g\X) / g(X)f\ \X\ | 2 ] 
converges to zero in probability provided 5 | oo and er | at appropriate speeds. 
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Hardle and Stoker, 1991, page 992) showed that under some regularity condi- 
tions we have for any e > 0, 

sup[\g(s) - g(s)\l 9{s)>m) ] = O p [{n^a d )- 1 ' 2 } 

and 

sup[\g'(s)-g'(s)\l g{s>is/2) } = O p [{n^o d + 2 )^ 2 ] 

Therefore 

*M\(g'/g) - {g/g)\h>m)\ = o P rV- (e/2 V+ 2 r 1/2 ] 

Hence for large n we have 

n(L-g') 2 \\x\\ 2 ] 

= E[(g'/gnX\\% <s ] + E[((g'/g) - {g'/g)f\ \X\ \ 2 l g>5 ] 

< E[{g'/gf\\x\\H a<25 } + mg/g) - (g'/g)) 2 \\x\\ 2 i g>is)/2 ] 

< E[(g'/g) 2 \\X\\ 2 l g<2S } +O p [5- 2 (n 1 -^a d ^)-^] -E[||X|| 2 ] 

Assume that E[||X| | 2 ] < oo. Since (g f /g) 2 \\X\\ 2 l g<25 is dominated by (g'/g) 2 \\X\ 
and E[(g'/g) 2 \ \X\ \ 2 l g < 2 s] is finite by assumptions, therefore E[(g'/g) 2 \ \X\ \ 2 l g< 2s] 
converges to zero when 5 goes to zero. By the assumptions, 5~ 2 (n 1- ^/ 2 ) o- d + 2 ) -1 / 2 
also converges to zero, therefore E[(£ n — (g'/g)) 2 \ \X\ \ 2 ] = o p {\). 

In the same fashion we can construct ( n (s, y) to estimate ti '/ h, except that we 
use (Tj, Yj) as observations. Then an estimator for I can be defined as 

In = Cn — £,n- (8) 

Since E[(£ n — g'/g) 2 \ \X\ | 2 ] and E[(( n — h' /h) 2 \\X\\ 2 } converges to zero in prob- 
ability, we have E[(l n ((3jX, Y) — f((3jX, Y)) 2 \ \X\ | 2 ] converges to zero in prob- 
ability, and (4) is satisfied. □ 
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